autodb: splitting up tables automatically
2025-07-03
ChickWeight (\(578 \times 4\))
| 0 |
1 |
42 |
1 |
| 0 |
2 |
40 |
1 |
| 0 |
3 |
43 |
1 |
| 0 |
4 |
42 |
1 |
| 2 |
1 |
51 |
1 |
| 2 |
2 |
49 |
1 |
| 2 |
3 |
39 |
1 |
| 2 |
4 |
49 |
1 |
ChickWeight (\(578 \times 3\), \(50 \times 2\))
| 0 |
1 |
42 |
| 0 |
2 |
40 |
| 0 |
3 |
43 |
| 0 |
4 |
42 |
| 2 |
1 |
51 |
| 2 |
2 |
49 |
| 2 |
3 |
39 |
| 2 |
4 |
49 |
| 1 |
1 |
| 2 |
1 |
| 3 |
1 |
| 4 |
1 |
| 5 |
1 |
| 6 |
1 |
| 7 |
1 |
| 8 |
1 |
Database normalisation
Tidy data
In tidy data:
- Each variable forms a column.
- Each observation forms a row.
- Each type of observational unit forms a table.
This is Codd’s 3rd normal form [i.e. database normalisation], but with the constraints framed in statistical language…”
~ Wickham, Tidy data (2014)
What if we don’t know the structure?
No expected structure, or the data’s actual structure is different:
- No documentation
- Faulty documentation
- Data entry errors
autodb
autodb turns a table into a database:
chick_db <- autodb(ChickWeight)
database with 2 relations
4 attributes: weight, Time, Chick, Diet
relation Chick: Chick, Diet; 50 records
key 1: Chick
relation Time_Chick: Time, Chick, weight; 578 records
key 1: Time, Chick
references:
Time_Chick.{Chick} -> Chick.{Chick}
Quick enough for interactive use for small data sets:
Unit: milliseconds
min lq mean median uq max neval
15.9692 18.6487 19.74904 19.4545 20.8726 27.5227 100
Plotting with GraphViz
chick_code <- gv(chick_db)
DiagrammeR::grViz(chick_code)
What we don’t see
What we don’t see
Finding structure-breaking errors
How it works: FDs
Functional dependencies (FDs):
\(X \rightarrow y\) (\(X\) determines \(y\))
For any pair of rows, if their values in variables \(X\) are equal, then their values in variable \(y\) are equal
\(y := f(X)\) for some (unknown) \(f\)
How it works: FDs to schema
Bernstein synthesis, 1976
Synthesizing third normal form relations from functional dependencies. ACM Trans. Database Syst.
How it works: FD search
Generate FD search space: pick a variable for \(y\), pick a subset of the rest for \(X\)
Check each FD: every pair that matches on \(X\) matches on \(y\)
Search space + validation = automatic search!
Data with \(n\) variables has \(n 2^{n-1}\) possible FDs
How it works: FD search
FDHits algorithm, 2024
Discovering functional dependencies through hitting set enumeration. Proc. ACM Manag. Data
What did I do?
Implementation in R (and lots of tests)
Diagrams (esp. for multiple keys)
Some search customisation (e.g. limiting what can be in \(X\))
Databases are vector-like
names(chick_db) <- c("Chick", "Measurement")
Databases are vector-like
Databases are vector-like
chick_db[c(1, 1, 2, 2, 2)]
Databases are vector-like
x <- db_simple
x$title <- chick_db$Chick
Databases are vector-like
data.frame(
FD = discover(ChickWeight),
relvar = autodb(ChickWeight)
)
| Chick -> Diet |
relation Chick (50 records) |
| Time, Chick -> weight |
relation Time_Chick (578 records) |
The end
autodb is on CRAN
Future plans: better performance, normalising to BCNF, splitting to remove missing values
References:
- Bernstein P. A. (1976) Synthesizing third normal form relations from functional dependencies
- Bleifuss et al. (2024) Discovering functional dependencies through hitting set enumeration
- Other data profiling research by the Information Systems Group at Hasso-Plattner-Institut
Speed
| data |
nrow |
ncol |
#FDs |
FDs |
DB |
| ChickWeight |
578 |
4 |
2 |
0.01 |
0.02 |
| planes |
3322 |
9 |
24 |
0.16 |
0.52 |
| nudge |
447 |
25 |
3732 |
3.05 |
18.25 |
| liquor |
1020561 |
24 |
1637 |
1274.50 |
3188.39 |
NAs: the standard-practice cheat
| 1 |
2022-05-02 |
NA |
| 2 |
2022-06-06 |
NA |
| 3 |
2022-04-01 |
2022-10-07 |
| 4 |
2022-03-19 |
NA |
Equality to NA is NA, not true or false.
This breaks functional dependencies.
Standard fix: pretend they’re non-missing and equal.
NAs: the standard-practice cheat
| 1 |
2022-05-02 |
NA |
| 2 |
2022-06-06 |
NA |
| 3 |
2022-04-01 |
2022-10-07 |
| 4 |
2022-03-19 |
NA |
NAs: the ideal (on the to-do list)
BCNF
2 functional dependencies
3 attributes: a, b, c
a, b -> c
c -> a
BCNF
2 functional dependencies
3 attributes: a, b, c
a, b -> c
c -> a
BCNF
2 functional dependencies
3 attributes: a, b, c
a, b -> c
c -> a
Python: alteryx/autonormalize